Fix read from multiple s3 regions #1453

jiakai-li · 2024-12-20T01:48:57Z

This PR closes:

[bug] read from multiple s3 regions #1279

jiakai-li · 2024-12-20T01:55:50Z

Hey @kevinjqliu , I hope you are ready for the Christmas time :-)

After some investigation, I noticed the PyArrowFileIO._initialize_fs function doesn't take netloc parameter into account when initialize the S3FileSystem. Instead, it always uses the region found in properties. I modified the precedence order to make netloc take higher priority compare to the region saved in properties.

Would be keen to get some quick feedback, and will add more unit test if this sounds a fix on the correct track? Thanks!

kevinjqliu · 2024-12-20T16:55:39Z

@jiakai-li Thanks for working on this! And happy holidays :)

I noticed the PyArrowFileIO._initialize_fs function doesn't take netloc parameter into account when initialize the S3FileSystem

looking through the usage for _initialize_fs, it doesnt look like netloc is used at all.

it always uses the region found in properties

I think that's one of the problems we need to tackle. The current S3 configuration requires a specific "region" to be set. This assumes that all data and metadata files are from the same region as the one specified. But what if i have some files in one region and some in another?

I think a potential solution might be to omit the "region" property and allow the S3FileSystem to determine the proper region using resolve_s3_region. This is recommended in the S3FileSystem docs for region.

Another potential issue is the way we cache fs, it assumes that there's only one fs per scheme. With the region approach above, we break this assumption.

kevinjqliu · 2024-12-20T16:57:27Z

BTW theres a similar issue in #1041

tests/io/test_pyarrow.py

jiakai-li · 2024-12-20T17:24:57Z

Thank you @kevinjqliu , just try to clear my head a little bit

I think a potential solution might be to omit the "region" property and allow the S3FileSystem to determine the proper region using resolve_s3_region. This is recommended in the S3FileSystem docs for region.

Is the change I made in accordance with this option? What I've done essentially is using the netloc to determine the bucket region. Only in case when, for some reason, the region cannot be determined then we fall back to the properties configuration.

Another potential issue is the way we cache fs, it assumes that there's only one fs per scheme. With the region approach above, we break this assumption.

Please correct me if I miss something for how the fs cache works. But here is my understanding:

I see we use lru_cache, so it should cache one fs for each different bucket since they will have different netloc and thus a different key in the cache. Previously, it looks like we only have one cached fs. It seems relates to the netloc not being used. As a result, netloc is not connected with the client_kwargs["region"] configuration. In this case, even two cache keys point to two fs instances, the two fs instances are still of the same region (the one configured in properties).

I think solving the netloc issue will also resolve the cache issue as the lru_cache key now links with the region and will return the correct instance.

jiakai-li · 2024-12-20T18:04:37Z

BTW theres a similar issue in #1041

Can I tackle on this issue as well if there is no one working on it?

kevinjqliu · 2024-12-20T21:38:43Z

Is the change I made in accordance with this option? What I've done essentially is using the netloc to determine the bucket region. Only in case when, for some reason, the region cannot be determined then we fall back to the properties configuration.

Im dont think netloc can be used to determine the region. S3 URI scheme doesn't use netloc, only S3 URL does.
For example, heres how fs_by_scheme is typically used

iceberg-python/pyiceberg/io/pyarrow.py

Lines 434 to 436 in dbcf65b

    
           scheme, netloc, path = self.parse_location(location) 
        
           return PyArrowFile( 
        
               fs=self.fs_by_scheme(scheme, netloc),

and running an example S3 URI:

from pyiceberg.io.pyarrow import PyArrowFileIO
scheme, netloc, path = PyArrowFileIO.parse_location("s3://a/b/c/1.txt")
# returns ('s3', 'a', 'a/b/c/1.txt')

In order to support multiple regions, we might need to call resolve_s3_region first and pass the region to fs_by_scheme. If you look at it from S3FileSystem's perspective, we need a new S3FileSystem object per region. This relates to how the FileSystem is cached.

BTW a good test scenario can be a table where my metadata files are stored in one bucket while my data files are stored in another. We might be able to construct this test case by modifying the minio settings to create different regional buckets; I haven't tested this yet.

kevinjqliu · 2024-12-20T21:39:36Z

Can I tackle on this issue as well if there is no one working on it?

I don't think anyone's working on it right now, feel free to pick it up.

jiakai-li · 2024-12-20T23:31:56Z

Thank you @kevinjqliu , can I have some more guidance on this please?

Im dont think netloc can be used to determine the region. S3 URI scheme doesn't use netloc, only S3 URL does.

I did some search and seems in terms of s3 scheme, the format is s3://<bucket-name>/<key-name>. The netloc parsed from urlparse (essentially passed to the _initialize_fs call) then points to the bucket-name.

In the below example, I would expect 'a' to be netloc and also the bucket-name. Is there an exception that doesn't follow this format?

from pyiceberg.io.pyarrow import PyArrowFileIO
scheme, netloc, path = PyArrowFileIO.parse_location("s3://a/b/c/1.txt")
# returns ('s3', 'a', 'a/b/c/1.txt')

BTW a good test scenario can be a table where my metadata files are stored in one bucket while my data files are stored in another. We might be able to construct this test case by modifying the minio settings to create different regional buckets; I haven't tested this yet.

Yep, I tested the change using a similar scenario locally with my own handcrafted s3 files. But will add more proper test cases as I make more progress. Thanks again!

kevinjqliu · 2024-12-20T23:54:17Z

I did some search and seems in terms of s3 scheme, the format is s3:///. The netloc parsed from urlparse (essentially passed to the _initialize_fs call) then points to the bucket-name.

ah yes, you're right. sorry for the confusion. I was thinking of something else.
Using netloc, we're essentially mapping S3 bucket to its region. And this is fine because each bucket should belong to a specific region; perhaps we can cache this.
This makes sense to me!

kevinjqliu · 2024-12-20T23:54:53Z

BTW there are 2 FileIO implementations, one for pyarrow, another for fsspec.

We might want to do the same for fsspec

iceberg-python/pyiceberg/io/fsspec.py

Lines 133 to 141 in dbcf65b

    
           def _s3(properties: Properties) -> AbstractFileSystem: 
        
               from s3fs import S3FileSystem 
        
               client_kwargs = { 
        
                   "endpoint_url": properties.get(S3_ENDPOINT), 
        
                   "aws_access_key_id": get_first_property_value(properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), 
        
                   "aws_secret_access_key": get_first_property_value(properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), 
        
                   "aws_session_token": get_first_property_value(properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), 
        
                   "region_name": get_first_property_value(properties, S3_REGION, AWS_REGION),

jiakai-li · 2024-12-20T23:56:20Z

Sweet, I'll go ahead with this approach then. Thanks very much @kevinjqliu !

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

jiakai-li · 2024-12-23T00:57:20Z

BTW there are 2 FileIO implementations, one for pyarrow, another for fsspec.

We might want to do the same for fsspec

iceberg-python/pyiceberg/io/fsspec.py

Lines 133 to 141 in dbcf65b

def _s3(properties: Properties) -> AbstractFileSystem:

from s3fs import S3FileSystem

client_kwargs = {

"endpoint_url": properties.get(S3_ENDPOINT),

"aws_access_key_id": get_first_property_value(properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),

"aws_secret_access_key": get_first_property_value(properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),

"aws_session_token": get_first_property_value(properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),

"region_name": get_first_property_value(properties, S3_REGION, AWS_REGION),

Hi @kevinjqliu , for the above concern, I tested it locally and also did some investigation. According to what I found here seems fsspec doesn't have the same issue as pyarrow. So I guess we can leave it?

kevinjqliu · 2024-12-23T20:01:55Z

According to what I found here seems fsspec doesn't have the same issue as pyarrow. So I guess we can leave it?

wow thats interesting, i didn't know about that. I like that solution :) Hopefully pyarrow fs will have this feature one day

Whenever a new bucket is used, it will first find out which region it belongs and then use the client for that region.

kevinjqliu

some more comments, thanks for working on this!

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu · 2024-12-23T21:12:31Z

pyiceberg/io/pyarrow.py

@@ -1508,7 +1512,7 @@ def _record_batches_from_scan_tasks_and_deletes(
            if self._limit is not None and total_row_count >= self._limit:
                break
            batches = _task_to_record_batches(
-                self._fs,
+                _fs_from_file_path(task.file.file_path, self._io),


nice! I think this solves #1041 as well.

Yep, it is :-)

Nit: I noticed that we first pass in the path here, and the IO as a second argument. For _read_all_delete_files it is the other way around. How about unifying this?

jiakai-li · 2024-12-24T02:23:51Z

This PR is ready for review now. Thanks very much and merry christmas! Please let me know if any further change is required.

Fokko · 2024-12-29T05:59:28Z

pyiceberg/io/pyarrow.py

@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
            }

+            # Override the default s3.region if netloc(bucket) resolves to a different region
+            try:
+                client_kwargs["region"] = resolve_s3_region(netloc)


What about doing this lookup only when the region is not provided explicitly? I think this will do another call to S3.

Thank you Fokko, my understanding is that the problem occurs when the provided region doesn't match the data file bucket region, and that will fail the file read for pyarrow. And by overwriting the bucket region (fall back to provided region), we make sure the real bucket region that a data file is stored takes precedence. (this function is cached when using fs_by_scheme, so it will be called only for new bucket that's not resolved previously to save calls to S3)

I think there are these 3 cases we're worried about:

# region match region=us-east-1 s3://foo-us-east-1/ s3://bar-us-east-1/ # region mismatch region=us-west-2 s3://foo-us-east-1/ s3://bar-us-west-2/ # region not provided region=None s3://foo-us-east-1/ s3://bar-us-west-2/

We have 2 options here

use region when provided, fallback to resolve_s3_region

always use resolve_s3_region

resolve_s3_region, fall back to region

Option 1 is difficult since we dont know that the provided region is wrong until we try to use the FileIO.

The code above uses option 2 which will always make an extra call to S3 to get the correct bucket region. This extra call to S3 is cached though, so resolve_s3_region is only called once per bucket.
This is similar to the cache_regions option for s3fs.core.S3FileSystem

I like option 3, we can resolve the bucket region and fallback to the provided region. It might be confusing to the enduser when a region is specified but the FileIO uses a different region, so lets add a warning for that.

Something like this

# Attempt to resolve the S3 region for the bucket, falling back to configured region if resolution fails # Note, bucket resolution is cached and only called once per bucket provided_region = get_first_property_value(self.properties, S3_REGION, AWS_REGION) try: bucket_region = resolve_s3_region(bucket=netloc) except (OSError, TypeError): bucket_region = None logger.warning(f"Unable to resolve region for bucket {netloc}, using default region {provided_region}") if bucket_region and bucket_region != provided_region: logger.warning( f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: " f"provided region {provided_region}, actual region {bucket_region}" ) region = bucket_region or provided_region client_kwargs: Dict[str, Any] = { "endpoint_override": self.properties.get(S3_ENDPOINT), "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), "region": region, }

Thanks for elaborating on this, I want to make sure that the user is aware of it, and I think we do that right with the warning.

For some additional context, for Java we don't have this issue because when you try to query the wrong region, the AWS SDK returns an HTTP 301 to the correct region. This introduces another 200 call but that's okay. The PyArrow implementation (that I believe uses the AWS C++ SDK underneath), just throws an error that it got a 301. We saw that in the past for example here: #515 (comment).

pyiceberg/io/pyarrow.py

… `pyarrow.S3FileSystem`)

kevinjqliu

Another round of comments, thanks for working on this!

tests/io/test_pyarrow.py

kevinjqliu · 2025-01-03T22:16:05Z

pyiceberg/io/pyarrow.py

        if scheme in {"s3", "s3a", "s3n", "oss"}:
-            from pyarrow.fs import S3FileSystem
+            from pyarrow.fs import S3FileSystem, resolve_s3_region


nit: since oss scheme uses this path, does it also support resolve_s3_region?

Thank you @kevinjqliu . This is a really good catch. I didn't find too much information regarding support of oss regions by pyarrow.fs.resolve_s3_region. But I tried it on my end and it doens't seem to work as it throws me an error complaining the bucket cannot be found.

This could be a problem though, especially if the same bucket name is used by both Aliyun and AWS. In which case the user-provided bucket region for Aliyun could be wrongly overwritten (by the resolved AWS one).

I separate the oss path from s3 for now as I'm not sure if we want to tackle on the oss now (and I feel we probably want to treat the two protocol differently?). I also break the _initialize_fs code chunk into smaller blocks to make it a bit easier for future modification.

I didn't find too much information regarding support of oss regions by pyarrow.fs.resolve_s3_region. But I tried it on my end and it doens't seem to work as it throws me an error complaining the bucket cannot be found.

i dont think its supported, the underlying call is looking for x-amz-bucket-region which i dont think aliyun will set
https://github.com/apache/arrow/blob/48d5151b87f1b8f977344c7ac20cb0810e46f733/cpp/src/arrow/filesystem/s3fs.cc#L660

This could be a problem though, especially if the same bucket name is used by both Aliyun and AWS. In which case the user-provided bucket region for Aliyun could be wrongly overwritten (by the resolved AWS one).

since we're using both scheme and bucket to cache FS, this should be fine right? For the case of oss://bucket and s3://bucket.

I separate the oss path from s3 for now as I'm not sure if we want to tackle on the oss now (and I feel we probably want to treat the two protocol differently?). I also break the _initialize_fs code chunk into smaller blocks to make it a bit easier for future modification.

yea lets just deal with s3 for now. Btw fsspec splits construct per fs, i think it looks pretty clean.

I dont think its supported, the underlying call is looking for x-amz-bucket-region which i dont think aliyun will set

Thank you for checking that, I should have looked at it :-)

since we're using both scheme and bucket to cache FS, this should be fine right? For the case of oss://bucket and s3://bucket

Yes, there is no issue after the change now. What I was thinking is for the oss://bucket scenario (ignore the caching behavior). If the bucket used by oss also exists in AWS, then the previous version (before your comment) would try to resolve the bucket and use it to overwrite the defaul setting. This will be wrong though, because oss bucket region cannot be resolved using pyarrow.

I updated the test case to take this into account and also added an integration test for multiple filesystem read.

kevinjqliu · 2025-01-03T22:39:39Z

pyiceberg/io/pyarrow.py

@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
            }

+            # Override the default s3.region if netloc(bucket) resolves to a different region
+            try:
+                client_kwargs["region"] = resolve_s3_region(netloc)


I think there are these 3 cases we're worried about:

# region match region=us-east-1 s3://foo-us-east-1/ s3://bar-us-east-1/ # region mismatch region=us-west-2 s3://foo-us-east-1/ s3://bar-us-west-2/ # region not provided region=None s3://foo-us-east-1/ s3://bar-us-west-2/

We have 2 options here

use region when provided, fallback to resolve_s3_region

always use resolve_s3_region

resolve_s3_region, fall back to region

Option 1 is difficult since we dont know that the provided region is wrong until we try to use the FileIO.

The code above uses option 2 which will always make an extra call to S3 to get the correct bucket region. This extra call to S3 is cached though, so resolve_s3_region is only called once per bucket.
This is similar to the cache_regions option for s3fs.core.S3FileSystem

I like option 3, we can resolve the bucket region and fallback to the provided region. It might be confusing to the enduser when a region is specified but the FileIO uses a different region, so lets add a warning for that.

Something like this

# Attempt to resolve the S3 region for the bucket, falling back to configured region if resolution fails # Note, bucket resolution is cached and only called once per bucket provided_region = get_first_property_value(self.properties, S3_REGION, AWS_REGION) try: bucket_region = resolve_s3_region(bucket=netloc) except (OSError, TypeError): bucket_region = None logger.warning(f"Unable to resolve region for bucket {netloc}, using default region {provided_region}") if bucket_region and bucket_region != provided_region: logger.warning( f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: " f"provided region {provided_region}, actual region {bucket_region}" ) region = bucket_region or provided_region client_kwargs: Dict[str, Any] = { "endpoint_override": self.properties.get(S3_ENDPOINT), "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), "region": region, }

mkdocs/docs/configuration.md

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu

LGTM! Thank you for working on this

kevinjqliu · 2025-01-04T19:45:06Z

somethings going on with github runner, make lint works locally
https://github.com/apache/iceberg-python/actions/runs/12612627362/job/35150820385?pr=1453

jiakai-li · 2025-01-04T19:55:58Z

somethings going on with github runner, make lint works locally https://github.com/apache/iceberg-python/actions/runs/12612627362/job/35150820385?pr=1453

I saw that as well. Seems the pyproject.toml file is missing from the pre-commit working directory.

kevinjqliu · 2025-01-05T20:02:44Z

#1485 to replace pycln with ruff linter

Fokko

Sorry for leaving this hanging. I wanted to do some local checks to ensure that the caching works properly. The Arrow FS is pretty bulky, and we had some issues with the caching in the past which caused some performance regression.

I left some small comments, but this looks good to me 👍

Fokko · 2025-01-05T22:36:36Z

pyiceberg/io/pyarrow.py

@@ -190,13 +190,6 @@
 T = TypeVar("T")


-class PyArrowLocalFileSystem(pyarrow.fs.LocalFileSystem):


People could rely on this for other things, I don't think we can just remove this without deprecation.

Let's revert this change, I don't like the inline declaration of the class.

Fokko · 2025-01-05T22:37:42Z

pyiceberg/io/pyarrow.py


-            if proxy_uri := self.properties.get(S3_PROXY_URI):
-                client_kwargs["proxy_options"] = proxy_uri
+        elif scheme in ("hdfs", "viewfs"):


Let's be consistent here:

Suggested change

elif scheme in ("hdfs", "viewfs"):

elif scheme in {"hdfs", "viewfs"}:

Fokko · 2025-01-05T22:38:53Z

pyiceberg/io/pyarrow.py

+        bucket_region = bucket_region or provided_region
+        if bucket_region != provided_region:
+            logger.warning(
+                f"PyArrow FileIO overriding S3 bucket region for bucket {netloc}: "
+                f"provided region {provided_region}, actual region {bucket_region}"
+            )


I like this one, thanks!

Fokko · 2025-01-05T22:39:26Z

pyiceberg/io/pyarrow.py

+        if scheme in {"oss"}:
+            return self._initialize_oss_fs(scheme, netloc)


Do we know if Alibaba doesn't support this?

I didn't find an authoritive document explicily saying it's not supported from pyarrow. But I tested it locally and it doesn't work for Alibaba. Kevin also helped to check the pyarrow code in this comment. Seems pyarrow is using the x-amz-bucket-region header to determine the bucket region, which seems to be an AWS thing.

Fokko · 2025-01-05T22:40:25Z

pyiceberg/io/pyarrow.py

+                deprecation_message(
+                    deprecated_in="0.8.0",
+                    removed_in="0.9.0",
+                    help_message=f"The property {GCS_ENDPOINT} is deprecated, please use {GCS_SERVICE_HOST} instead",
+                )


Should we remove this one, while at it?

Thank you Fokko, do you mean to remove the deprecation_message or the GCS_ENDPOINT property? It says this option will be removed in 0.9.0, is it ok if we remove it now?

we can remove it now since the next release using the main branch will be for 0.9.0.

But id prefer to remove it in a separate PR since there's also references to GCS_ENDPOINT in fsspec
https://grep.app/search?q=GCS_ENDPOINT&filter[repo][0]=apache/iceberg-python

Fokko · 2025-01-05T22:49:16Z

pyiceberg/io/pyarrow.py

-                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
-            }
+        elif scheme in {"s3", "s3a", "s3n"}:
+            return self._initialize_s3_fs(scheme, netloc)


Many of the methods don't require all the parameters, for example, _initialize_s3_fs does not use the scheme. Should we remove those?

Fokko · 2025-01-05T22:57:36Z

pyiceberg/io/pyarrow.py

@@ -362,6 +362,12 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
            }

+            # Override the default s3.region if netloc(bucket) resolves to a different region
+            try:
+                client_kwargs["region"] = resolve_s3_region(netloc)


Thanks for elaborating on this, I want to make sure that the user is aware of it, and I think we do that right with the warning.

For some additional context, for Java we don't have this issue because when you try to query the wrong region, the AWS SDK returns an HTTP 301 to the correct region. This introduces another 200 call but that's okay. The PyArrow implementation (that I believe uses the AWS C++ SDK underneath), just throws an error that it got a 301. We saw that in the past for example here: #515 (comment).

kevinjqliu

LGTM! Looks like CI is passing now too.

pyiceberg/io/pyarrow.py

kevinjqliu · 2025-01-06T02:04:05Z

pyiceberg/io/pyarrow.py

+                deprecation_message(
+                    deprecated_in="0.8.0",
+                    removed_in="0.9.0",
+                    help_message=f"The property {GCS_ENDPOINT} is deprecated, please use {GCS_SERVICE_HOST} instead",
+                )


we can remove it now since the next release using the main branch will be for 0.9.0.

But id prefer to remove it in a separate PR since there's also references to GCS_ENDPOINT in fsspec
https://grep.app/search?q=GCS_ENDPOINT&filter[repo][0]=apache/iceberg-python

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu · 2025-01-06T14:47:57Z

Thank you @jiakai-li for the contribution and @Fokko for the review :)

jiakai-li added 2 commits December 20, 2024 12:39

Take netloc into account for s3 filesystem when calling _initialize_fs

7f52a1a

Fix unit test for s3 fileystem

b53e89f

jiakai-li marked this pull request as draft December 20, 2024 01:59

kevinjqliu reviewed Dec 20, 2024

View reviewed changes

tests/io/test_pyarrow.py Show resolved Hide resolved

kevinjqliu reviewed Dec 21, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

jiakai-li added 4 commits December 22, 2024 09:21

Merge branch 'main' into fix-read-from-multiple-s3-regions

7699a76

Update ArrowScan to use different FileSystem per file

eb5e491

Add unit test for PyArrorFileIO.fs_by_scheme cache behavior

0c61ac8

Add error handling

327dbac

kevinjqliu reviewed Dec 23, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

Update tests/io/test_pyarrow.py

b4fccf2

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu reviewed Dec 23, 2024

View reviewed changes

jiakai-li added 4 commits December 24, 2024 10:35

Update s3.region document and a test case

48bb811

Add test case for PyArrowFileIO.new_input multi region

8404e6b

Shuffle code location for better maintainability

53951f5

Comment for future integration test

51fb6ff

jiakai-li added 2 commits December 24, 2024 15:41

Typo fix

0cd06c4

Document wording

64fbdab

Fokko reviewed Dec 29, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

jiakai-li added 3 commits December 30, 2024 08:25

Add warning when the bucket region for a file cannot be resolved (for…

37d9ec2

… `pyarrow.S3FileSystem`)

Merge branch 'main' into fix-read-from-multiple-s3-regions

74e78ae

Fix code linting

4ff4a7d

kevinjqliu reviewed Jan 3, 2025

View reviewed changes

jiakai-li and others added 7 commits January 4, 2025 12:56

Update mkdocs/docs/configuration.md

b56f2ee

Co-authored-by: Kevin Liu <[email protected]>

Merge branch 'main' into fix-read-from-multiple-s3-regions

2a4cee1

Code refactoring

9cc3a30

Unit test

ba5ef76

Code refactoring

8f06a15

Test cases

e5cac02

Code format

9652baf

kevinjqliu approved these changes Jan 4, 2025

View reviewed changes

kevinjqliu requested a review from Fokko January 4, 2025 19:38

Fokko approved these changes Jan 5, 2025

View reviewed changes

jiakai-li added 2 commits January 6, 2025 00:19

Merge branch 'main' into fix-read-from-multiple-s3-regions

bc2adc7

Code tidy-up

4b83fc0

kevinjqliu approved these changes Jan 6, 2025

View reviewed changes

Update pyiceberg/io/pyarrow.py

7f207bf

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu merged commit 551f524 into apache:main Jan 6, 2025
8 checks passed

This was referenced Jan 7, 2025

[Bug] Cannot use PyIceberg with multiple FS #1041

Closed

[bug] read from multiple s3 regions #1279

Closed

jiakai-li deleted the fix-read-from-multiple-s3-regions branch January 7, 2025 17:18

		@@ -190,13 +190,6 @@
		T = TypeVar("T")


		class PyArrowLocalFileSystem(pyarrow.fs.LocalFileSystem):

	elif scheme in ("hdfs", "viewfs"):
	elif scheme in {"hdfs", "viewfs"}:

		if scheme in {"oss"}:
		return self._initialize_oss_fs(scheme, netloc)

Fix read from multiple s3 regions #1453

Fix read from multiple s3 regions #1453

Conversation

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 • edited Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 • edited Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 23, 2024 • edited Loading

kevinjqliu commented Dec 23, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

jiakai-li commented Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

jiakai-li Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 4, 2025

jiakai-li commented Jan 4, 2025

kevinjqliu commented Jan 5, 2025

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 6, 2025

jiakai-li commented Dec 20, 2024 •

edited

Loading

jiakai-li commented Dec 20, 2024 •

edited

Loading

jiakai-li commented Dec 23, 2024 •

edited

Loading

Fokko Dec 29, 2024 •

edited

Loading

jiakai-li commented Dec 24, 2024 •

edited

Loading

jiakai-li Dec 29, 2024 •

edited

Loading